QUT Logo

Summary

The document reads .ris bibliographic files, filters selected studies, and categorises data sources into Articles, Packages, and Kaggle. Dataset visualisations summarise population, sport type, data type, and geographic distribution. A final evaluation scores all datasets according to predefined criteria, comparing their suitability to be used to generate synthetic dataset using Statistical and/or GAN-based approaches.

  • Article: Data validated and used in a published paper.
  • Package: Dataset available through an CRAN or Python package.
  • Kaggle: Dataset accessible from the Kaggle.com platform.

Aim

To compile, categorise, and visualise publicly available sports datasets from bibliographic and online sources, and to evaluate them using a scoring framework derived from the literature review to determine which datasets are best suited for Statistical or GAN-based modeling in the development of synthetic datasets.

Approach

1.Read bibliographic data from .ris files and filter relevant studies through manual and Shiny-based screening. 2.Create a combined dataset incorporating Articles, Packages, and Kaggle entries. 3.Generate summary statistics and interactive visualizations:

- Scatter and bar plots showing sample size, population, sport type, and data type.
- Global maps and stacked bar plots showing dataset distribution by country.
- Sankey diagram linking datasets, sports, and collected variables.

4.Define a scoring system for dataset quality across eight evaluation criteria. 5.Rank all datasets by total score and visualise the results across Statistical and GAN-based categories.

Results

  • After duplicate removal and screening, the number of bibliographic entries was reduced from 278 to 50.
  • Datasets were categorised into Articles, Packages, and Kaggle sources.
  • Summary visualizations described distributions by sample size, population type, sport coverage, and data type.

Conclusions

The analysis identifies GAN-based datasets primarily rely on video and image data focused on athlete performance and activity detection, while Statistical datasets encompass tabular (player and game statistics), physiological and survey-based data. The evaluation criteria highlights datasets most suitable for each approach, supporting the selection of appropriate data sources for developing synthetic datasets using Statistical or GAN-based methods.

Category Number of Datasets Data Types Population Most Frequent Sports Top 3 (by Score)
GAN-based 16 Video, Image Athlete Multiple, Basketball, Fitness TeamTrack, C-Sports, SportsMOT
Statistical 34 Tabular, Physiological, Medical Record, Survey, Accelerometer Athlete, Multiple Football, Baseball, Basketball, Fitness MTS-5, NCAA-ISP, LLBD

Data Preparation

# Inspect the files
list.files("data/database/ris/")
## [1] "ebscoSport.ris"     "ieee.ris"           "qut.ris"           
## [4] "scienceDirect.ris"  "springerNature.ris" "wos.ris"
# Error in webofS, remove empty line
wos <- readLines("data/database/ris/wos.ris")
wos <- wos[wos != ""] 
writeLines(wos, "data/database/ris/wos.ris")
# Read all as list, conver to df
files <- list.files("data/database/ris", pattern = "\\.ris$", full.names = T)

bibliography <- read_bibliography(filename = files, return_df = T)
bibliography
# Title preparition
bibliography$titleLower<-tolower(bibliography$title)
bibliography$titleLower<-strip(bibliography$titleLower, apostrophe.remove = TRUE)
head(bibliography$titleLower)
## [1] "secondary prevention of musculoskeletal sports injuries a scoping review of early detection and early intervention strategies"                                                
## [2] "the effects of rule changes in footballcode team sports a systematic review"                                                                                                  
## [3] "how physical education teachers are positioned in models scholarship a scoping review"                                                                                        
## [4] "physical education from lgbtq students perspective a systematic review of qualitative studies"                                                                                
## [5] "the altmetric score has a stronger relationship with article citations than journal impact factor and open access status a crosssectional analysis of sport sciences articles"
## [6] "methods of the national collegiate athletic association injury surveillance program â through â"
# Check for duplicates
unique(bibliography$titleLower[duplicated(bibliography$titleLower)])
##  [1] "crosssectional and longitudinal associations of active travel organised sport and physical education with accelerometerassessed moderatetovigorous physical activity in young people the international childrenâs accelerometry database"
##  [2] "match score dataset for team ball sports"                                                                                                                                                                                                
##  [3] "collective sports a multitask dataset for collective activity recognition"                                                                                                                                                               
##  [4] "tgc reid a dataset for sport event reidentification in the wild"                                                                                                                                                                         
##  [5] "regular sports services dataset of demographic frequency and service level agreement"                                                                                                                                                    
##  [6] "aspset an outdoor sports pose video dataset with d keypoint annotations"                                                                                                                                                                 
##  [7] "dataset for the analysis of tv viewer response to live sport broadcasts and sponsor messages"                                                                                                                                            
##  [8] "sports work strategy of college counselors based on mysql database big data analysis"                                                                                                                                                    
##  [9] "epidemiology of testicular trauma in sports analysis of the national electronic injury surveillance system database"                                                                                                                     
## [10] "administrative databases used for sports medicine research demonstrate significant differences in underlying patient demographics and resulting surgical trends"                                                                         
## [11] "analysis of research trends on elbow pain in overhead sports a bibliometric study based on web of science database and vosviewer"                                                                                                        
## [12] "the racial and sexual differences in emergency department visits for sportrelated spine fracture injuries a neiss database study"                                                                                                        
## [13] "comprehensive dataset on presarscov infection sportsrelated physical activity levels disease severity and treatment outcomes insights and implications for covid management"                                                             
## [14] "analysis of a comprehensive dataset influence of vaccination profile types and severe acute respiratory syndrome coronavirus reinfections on changes in sportsrelated physical activity one month after infection"
# Remove duplicated titles, keeping the first unique entry
bibliography <- bibliography[!duplicated(bibliography$titleLower), ]

# Check that duplicates are gone
any(duplicated(bibliography$titleLower))
## [1] FALSE
dim(bibliography)
## [1] 278 104
# Use shiny app to filter based on Abstract
# screen_abstracts(bibliography)

Bibliography

Filtering the dataset to keep only the selected articles, reducing the number from 278 to 89.

bibliographyRev <- read.csv("data/database/bibliography/bibliographyRev.csv")
bibliographyRev <- bibliographyRev %>%
                   filter(screened_abstracts == "selected") %>%
                   dplyr::select(author, title, year, keywords, abstract, doi, titlelower,
                                 filename)

# write.csv(bibliographyRev, "data/database/bibliography/bibliographyRevSelected.csv",
#           row.names = FALSE)

dim(bibliographyRev)
## [1] 89  8
colnames(bibliographyRev)
## [1] "author"     "title"      "year"       "keywords"   "abstract"  
## [6] "doi"        "titlelower" "filename"

Database

From the output file above, an excel file was created manually to categorise the databases into Articles (sheet = databaseAR), Packages (R and Python) (sheet = databasePA), and Kaggle (sheet = databaseOT).

  1. Articles: Databases were searched using the keywords “sport” AND “database” or “sport” AND “dataset” for publicly available datasets.

  2. Packages: Active and maintained packages were selected with databases related to athletes were included.

  3. Kaggle: In the datasets category, the keywords used were “injuries”, “sport”, “NFL”, and “AFL”. In the competitions category, only “sport” was used. For both categories, only the top 10 datasets were reviewed.

# List all the sheets 
excel_sheets("data/database/bibliography/bibliographyRevSelected.xlsx")
## [1] "bibliographyRevSelected" "databaseAR"             
## [3] "databasePA"              "databaseOT"             
## [5] "database"                "summary"                
## [7] "rank"

The database sheet contains the merged data from all files, and the summary sheet will be used to generate insights and visualisations.

Summary

# Read the summary sheet
summary <- read_excel("data/database/bibliography/bibliographyRevSelected.xlsx", 
                           sheet = "summary")

summary

General Analysis of the datasets

colorPalette <- RColorBrewer::brewer.pal(8, "Set2")

f1 <- plot_ly(summary,
              x = ~PopulationType, y = ~SampleOverall,
              type = 'scatter', mode = 'markers',
              color = ~PopulationType, colors = colorPalette,
              size = ~SampleOverall, sizes = c(10, 60),
              marker = list(opacity = 0.7, line = list(width = 1, color = '#333')),
              hoverinfo = 'text',
              text = ~paste('Dataset:', DatasetName,
                            '<br>Samples:', SampleOverall,
                            '<br>Population:', PopulationType),
              showlegend = FALSE)

f2 <- plot_ly(summary %>% count(SportType),
                x = ~SportType, y = ~n, type = 'bar',
                color = ~SportType, colors = colorPalette,
                showlegend = FALSE) 

f3 <- plot_ly(summary %>% count(DataTypeRaw),
                x = ~n, y = ~reorder(DataTypeRaw, n),
                type = 'bar', orientation = 'h',
                color = ~DataTypeRaw, colors = colorPalette,
                showlegend = FALSE) 

f4 <- plot_ly(summary,
              x = ~DataTypeRaw, y = ~SampleOverall,
              type = 'scatter', mode = 'markers',
              color = ~ValidData, colors = c('#E15759', '#59A14F'),
              size = ~SampleOverall, sizes = c(10, 50),
              marker = list(opacity = 0.7),
              hoverinfo = 'text',
              text = ~paste('Dataset:', DatasetName,
                            '<br>Type:', DataTypeRaw,
                            '<br>Valid:', ValidData,
                            '<br>Samples:', SampleOverall)) 

fig <- subplot(f1, f2, f3, f4, nrows = 2, margin = 0.20) %>%
  layout(
    plot_bgcolor = "rgba(0,0,0,0)",
    paper_bgcolor = "rgba(0,0,0,0)",
    showlegend = TRUE,
    legend = list(orientation = "h", x = 0.55, y = -0.15),
    annotations = list(
      list(text = "Sample Size by Population Type", 
           x = 0.20, y = 1.05, showarrow = FALSE, 
           xref='paper', yref='paper', font=list(size=14)),
      list(text = "Datasets by Sport Type", 
           x = 0.80, y = 1.05, showarrow = FALSE, 
           xref='paper', yref='paper', font=list(size=14)),
      list(text = "Data Type Distribution", x = 0.20, y = 0.47, 
           showarrow = FALSE, xref='paper', yref='paper', font=list(size=14)),
      list(text = "Sample Size vs Data Type (by Validity)", 
           x = 0.80, y = 0.47, showarrow = FALSE, xref='paper',
           yref='paper', font=list(size=14))
    )
  )

fig

Analysis of datasets by country

# Duplicate the rows by column and country. 
# Dataset with multiple countries will have multiple rows
summaryMap <- summary %>%
  mutate(Country = str_split(Country, ",")) %>%
  unnest(Country) %>%
  mutate(Country = str_trim(Country))

summaryMap
# Generate the information to display in the map
countrySummary <- summaryMap %>%
  group_by(Country) %>%
  summarise(
    nDatasets = n(),
    datasets = paste(unique(column), collapse = "; "),
    studyDesigns = paste(unique(StudyDesign), collapse = "; "),
    sampleRange = paste0("Min: ", min(SampleRaw, na.rm = TRUE),
                         " | Max: ", max(SampleOverall, na.rm = TRUE)),
    population = paste(unique(PopulationType), collapse = "; "),
    sex = paste(unique(PopulationSex), collapse = "; "),
    sports = paste(unique(SportType), collapse = "; "),
    reference = paste(unique(ReferenceURL), collapse = "; ")
  )

countrySummary
# Create hover text with the information above
countrySummary <- countrySummary %>%
  mutate(hoverText = paste0(
    "<b>", Country, "</b><br>",
    "Datasets: ", nDatasets, "<br>",
    "Study Design: ", studyDesigns, "<br>",
    "Sample Range: ", sampleRange, "<br>",
    "Population: ", population, "<br>",
    "Sex: ", sex, "<br>",
    "Sports: ", sports, "<br>",
    "Dataset Names: ", datasets, "<br>",
    "Reference: ", reference
  ))

countrySummary

The following map does not display the International (n = 9) and Commonwealth countries(n = 1) datasets.

# Interactive map
mapP <- plot_ly(
  data = countrySummary,
  type = "choropleth",
  locations = ~Country,
  locationmode = "country names",
  z = ~nDatasets,
  text = ~hoverText,
  hoverinfo = "text",
  colorscale = "Oranges",
  colorbar = list(title = "<b>Number of Datasets</b>")
) %>%
  layout(
    title = "Global Distribution of Public Sports Datasets",
    geo = list(
      showframe = FALSE,
      showcoastlines = TRUE,
      projection = list(type = "Mercator"),
      bgcolor = "rgba(0,0,0,0)",    
      domain = list(x = c(0.25, 1), y = c(0.15, 0.85)), 
      center = list(lon = 5, lat = 20)        
    ),
    plot_bgcolor = "rgba(0,0,0,0)", 
    paper_bgcolor = "rgba(0,0,0,0)"
  )

mapP

Analysis of datasets by type

The next dataframe and plot allow us to visualise the different types of datasets:

  1. Article: Data validated and used in a paper.
  2. Package: Dataset can be extracted from a CRAN or Python.
  3. Kaggle: Available from the website Kaggle.com.
# Generate the plot
barP <- plot_ly(
  data = barD,
  x = ~nRefs,
  y = ~Country,
  type = "bar",
  orientation = "h",
  color = ~sourceType,
  colors = c("#87CEEB", "#F4A261", "grey"),
  text = ~paste0(nRefs, " ", sourceType),
  textposition = "inside",  
  insidetextanchor = "middle",
  textfont = list(color = "black", size = 9, family = "Arial"),  
  hovertext = ~paste0(Country, ": ", nRefs, " ", sourceType, " datasets"),
  hoverinfo = "text",
  showlegend = TRUE
) %>%
  layout(
    barmode = "stack",
    title = "Datasets per Country by Source Type",
    xaxis = list(title = "Count", showgrid = FALSE),
    yaxis = list(title = "", automargin = TRUE),
    legend = list(
      title = list(text = "<b>Source Type</b>"),
      orientation = "h",            
      x = 0.82, y = 0.05,           
      bgcolor = "rgba(0,0,0,0)",    
      bordercolor = "rgba(0,0,0,0)" 
    ),
    plot_bgcolor = "rgba(0,0,0,0)",
    paper_bgcolor = "rgba(0,0,0,0)"
  )

barP
# Combine plots
fp <- subplot(
  mapP, barP,
  nrows = 2, shareX = F,
  heights = c(0.50, 0.50),
  margin = 0.8
) %>%
  layout(
    title = "Global Distribution of Public Sports Datasets"
  )

fp 

Analysis of Variables by Dataset

# Prepare the dataset selecting the relevant columns
variables <- summary %>%
  select(column, SportType, DataTypeRaw, VariablesCollected, ReferenceURL) %>%
  mutate(
    sourceType = case_when(
      str_detect(ReferenceURL, regex("Kaggle", ignore_case = TRUE)) ~ "Kaggle",
      str_detect(ReferenceURL, regex("CRAN|Python", ignore_case = TRUE)) ~ "Package",
      TRUE ~ "Article"
    ),
    VariablesCollected = str_replace_all(
      VariablesCollected,
      regex("•\\s*", ignore_case = TRUE),
      "<br>• "
    ),
    VariablesCollected = paste0("<b>Variables:</b>", VariablesCollected)
  ) 

variables

The following plot links three sections:

  1. Datasets on the left
  2. Sports in the middle
  3. Variables on the right

Each flow represents a connection between these sections and is colored by its data source type (Kaggle-Orange, Package-Green, or Article-Blue). Move the mouse over a flow to see the type of variables included in that connection.

# Create Node List 
nodes <- data.frame(
  name = unique(c(variables$column, variables$SportType, variables$DataTypeRaw))
)

# Function to map each label to numeric index
get_index <- function(x) match(x, nodes$name) - 1

# Links
links <- bind_rows(
  variables %>%
    transmute(
      source = get_index(column),
      target = get_index(SportType),
      type = sourceType,
      hover = VariablesCollected
    ),
  variables %>%
    transmute(
      source = get_index(SportType),
      target = get_index(DataTypeRaw),
      type = sourceType,
      hover = VariablesCollected
    )
)

color_map <- c(
  "Kaggle" = "#FFB347",
  "Package" = "#77DD77",
  "Article" = "#779ECB"
)
links$color <- color_map[links$type]

# Plotly Sankey 
fig <- plot_ly(
  type = "sankey",
  arrangement = "snap",
  node = list(
    label = nodes$name,
    color = "grey",
    pad = 15,
    thickness = 20,
    line = list(color = "black", width = 0.5)
  ),
  link = list(
    source = links$source,
    target = links$target,
    value = rep(1, nrow(links)),          
    color = links$color,
    customdata = links$hover,           
    hovertemplate = "%{customdata}<extra></extra>"
  )) 

fig <- fig %>%
  layout(
    title = list(
      text = "Variables Across Sports Datasets",
      font = list(size = 18, color = "#333", family = "Roboto")
    ),
    font = list(size = 12),
    margin = list(l = 10, r = 10, t = 60, b = 10),
    annotations = list(
      list(
        x = 0.00, y = 1.05,
        text = "<b>Datasets</b>",
        showarrow = FALSE,
        xref = "paper", yref = "paper",
        font = list(size = 14, color = "#FFB347", family = "Roboto")
      ),
      list(
        x = 0.50, y = 0.76,
        text = "<b>Sports</b>",
        showarrow = FALSE,
        xref = "paper", yref = "paper",
        font = list(size = 14, color = "#77DD77", family = "Roboto")
      ),
      list(
        x = 0.95, y = 0.75,
        text = "<b>Variables</b>",
        showarrow = FALSE,
        xref = "paper", yref = "paper",
        font = list(size = 14, color = "#779ECB", family = "Roboto")
      )
    )
  )

fig

Datasets

Manually will proceed analysing and scoring all the datasets based on the following table:

criteria <- tribble(
  ~Criterion, ~Score0, ~Score1, ~Score2, ~Score3, ~Score4, ~Score5, ~Why_it_matters, ~Weight,
  "PopulationType", "Undefined or unclear", "Mentioned but not specified", "Defined but non-specific", "Clearly defined target group", "Multiple subgroups", "Representative population", "Scope & external validity", "10%",
  "SampleRaw", "<100", "100–499", "500–999", "1k–9k", "10k–99k", "≥100k or continuous", "Statistical power / stability", "15%",
  "SportType", "Not stated", "Unclear type", "1 sport", "2–3 sports", "4–6 sports", "Multi-sport", "Cross-sport generalisability", "10%",
  "DataTypeRaw", "Derived only", "Simple data", "Data + derived", "Tabular + image/video", "Tabular + image + video", "Tabular + image + video + time-series", "Synthetic realism potential", "20%",
  "VariablesCollected", "Few", "Single data", "Metrics + demographics", "Metrics + player + game stats", "Metrics + player + game + context", "Metrics + player + game + context + metadata", "Modeling depth & richness", "30%",
  "Documentation", "None", "Minimal", "Variable list", "Readme + schema", "Schema + examples", "Full docs + code", "Reusability & reproducibility", "5%",
  "Access", "Payment", "Manual request", "Online request", "Partially open (data)", "Partially open (license)", "Fully open", "Ease of reuse", "5%",
  "Data Cleanliness", "Missing values", "Many issues", "Minor errors", "Clean", "Clean + consistent", "Curated / validated", "Preprocessing quality", "5%"
)

knitr::kable(criteria, align = "l", caption = "Dataset Evaluation Criteria")
Dataset Evaluation Criteria
Criterion Score0 Score1 Score2 Score3 Score4 Score5 Why_it_matters Weight
PopulationType Undefined or unclear Mentioned but not specified Defined but non-specific Clearly defined target group Multiple subgroups Representative population Scope & external validity 10%
SampleRaw <100 100–499 500–999 1k–9k 10k–99k ≥100k or continuous Statistical power / stability 15%
SportType Not stated Unclear type 1 sport 2–3 sports 4–6 sports Multi-sport Cross-sport generalisability 10%
DataTypeRaw Derived only Simple data Data + derived Tabular + image/video Tabular + image + video Tabular + image + video + time-series Synthetic realism potential 20%
VariablesCollected Few Single data Metrics + demographics Metrics + player + game stats Metrics + player + game + context Metrics + player + game + context + metadata Modeling depth & richness 30%
Documentation None Minimal Variable list Readme + schema Schema + examples Full docs + code Reusability & reproducibility 5%
Access Payment Manual request Online request Partially open (data) Partially open (license) Fully open Ease of reuse 5%
Data Cleanliness Missing values Many issues Minor errors Clean Clean + consistent Curated / validated Preprocessing quality 5%

We have added the rank sheet to the main file to store the scores. Two new columns were generated manually named as TotalScore representing the scores assigned to each dataset and literatureCategory representing the category assigned by the literature review analysis (GAN-based or Statistical).

# Select the variable of interest
summaryScore <- summary %>%
  select(column, TotalScore, literatureCategory, ValidData, PopulationType,
         SportType, DataTypeRaw)

summaryScore
# Rename the columns 
summaryScore <- summaryScore %>%
  rename(dataset = column, 
         group = literatureCategory, 
         value = TotalScore) %>% 
  mutate(group = as.factor(group)) %>%
  arrange(group, desc(value))
# Create two dataframes to separate plots. Plots will have hover with each dataset info
statsD <- filter(summaryScore, group == "Statistical")
ganD <- filter(summaryScore, group == "GAN-based")

colorPalette <- setNames(
  colorRampPalette(brewer.pal(min(max(length(unique(summaryScore$DataTypeRaw)), 3), 8), 
                              "Set2"))(length(unique(summaryScore$DataTypeRaw))),
  unique(summaryScore$DataTypeRaw))

stat <- plot_ly(statsD,
                x = ~value,
                y = ~reorder(dataset, value),
                type = 'bar',
                orientation = 'h',
                color = ~DataTypeRaw, 
                colors = colorPalette,
                hoverinfo = 'text',
                marker = list(line = list(width = 1.5)),
                text = ~paste(
                  "<b>Dataset:</b>", dataset,
                  "<br><b>Value:</b>", round(value, 3),
                  "<br><b>ValidData:</b>", ValidData,
                  "<br><b>Population:</b>", PopulationType,
                  "<br><b>Sport:</b>", SportType,
                  "<br><b>Data Type:</b>", DataTypeRaw
                )) %>%
  layout(title = "",
         xaxis = list(title = ""),
         yaxis = list(title = "Statistical"), tickmode = "array", automargin = TRUE)

gan <- plot_ly(ganD,
               x = ~value,
               y = ~reorder(dataset, value),
               type = 'bar',
               orientation = 'h',
               color = ~DataTypeRaw,
               colors = colorPalette,
               hoverinfo = 'text',
               marker = list(line = list(width = 1.5)),
               text = ~paste(
                 "<b>Dataset:</b>", dataset,
                 "<br><b>Value:</b>", round(value, 3),
                 "<br><b>ValidData:</b>", ValidData,
                 "<br><b>Population:</b>", PopulationType,
                 "<br><b>Sport:</b>", SportType,
                 "<br><b>Data Type:</b>", DataTypeRaw
               )) %>%
  layout(title = "",
         xaxis = list(title = ""), 
         yaxis = list(title = "GAN-based"), tickmode = "array", automargin = TRUE)

subplot(stat, gan, nrows = 2, shareX = FALSE, titleY = TRUE) %>%
  layout(title = "Ranking of datasets by Approach (Statistical vs GAN-based Approaches)",
         showlegend = TRUE)